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ABSTRACT 



This invention provides assistance to a user in accessing 
network attached information sources. In one aspect, the 
invention is a method for intelligently routing a user query 
to information sources relevant to that query, extracting 
relevant data fields from received responses, and intelli- 
gently presenting the extracted data in order of estimated 
relevance. The system of this invention implements one or 
more steps of the method in a centralized or distributed 
manner on one or more network attached computers. 
Further, this invention provides a novel language and imple- 
mentation that facilitates easily written and maintained 
descriptions of information source query and response for- 
mats. 

4 Claims, 5 Drawing Sheets 
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METHOD AND SYSTUM L;* \C ("HTML"), this language merely defines the format of 

INFORMATION WRITTEN IN A WflAPPER information display, making no attempt to hint at its mean- 

DESCR1ITION LANGUAGE TO EXECUTE ing or semantic content. Currently, no accepted "semantic 

QUERY ON A NETWORK markup language" for the Web exits, nor is one likely to 

5 adopted universally. The Internet can be expected to pose 

This application is a continuation of U.S. Ser. No. even greater problems. 

08/933/7S2 filed on Sep. 19, 1997, pending, the contents of Thus, the advent of intranets, the Internet, and the World 

which arc hereby incorporated by reference, which is a Wide Web have posed several fundamental problems for the 

continuation of provisional application Ser. No. 60/025,304 automatic services or agents designed to assist users to find 

filed Sep. 20, 1996, also incorporated by reference herein, to relevant information. First, no one such service has hereto- 
fore provided sufficient additional value to replace the use of 

1. FIELD OF THE INVENTION a \y eD browser having access to existing directories and 

Hie field of this invention relates to information access indices such as Yaho ° or Lvcos - Second, such services have 

over networks, and specifically to the automatic location and not vet ^en able to understand and competently parse 

evaluation of relevant information available over public or 15 relevanl information from the responses returned from a 

private networks from information sources in response to wide variety of Internet and Web information sources. Hiird, 

user aueries existing services and agents have not been easy to adapt to 

the ever- increasing numbers of sources with their ever- 

2. BACKGROUND changing response formats. This is due to the individualized, 

20 hand-coded interface to each Internet service and Web site 

The exponential growth of private intranets and the public utiHzcd by cxjsting agents (Yigal Mcns et aLf 1993f Rciriev- 

Internet has produced a daunting labyrinth of increasingly ing and i nte grating data from multiple information sources, 

numerous documents, databases and utilities. Almost any international Journal on Intelligent and Cooperative Infor- 

type of information is now available somewhere, but most mation Systems 2(2):127-158; 0. Etzioni et al., 1994, A 

users cannot find what they seek, and even expert users 2S so ftbot-based interface to the internet, CACM 37(7): 72-75; 

waste copious time and effort searching for appropriate B . Krulwich, 1.995, Bargain finder agent prototype, Techni- 

information sources. One problem is simply the increasingly cal report) Anderson Consulting; Alon Y Levy et al., 1995, 

large number of available information sources that are Data modc , and query eva i uat i on j n global information 

beyond the comprehension of a single user. A second systems . Journal of Intelligent Information Systems, Special 

problem, along with this growth in available information 3 o isstie on Networked Information Discovery and Retrieval 

and information sources, is a commensurate growth in 5 ( 2 ) ; Mike Perkowitz et al., 1995, Category translation: 

software utilities and methods to manage, access, and learning to understand information on the internet, Proc. 

present this information. Each utility has a different and \ Sth Int. Joint Conf on A J.). Preferably, a service or agent 

often unique interface and set of commands and capabilities, snouId be able t0 acccss a new or chan gcd Internet infor- 

and is appropriate for a different set of users and a different 35 ma tion source in order to automatically learn how to retrieve 

set of information types and sources. Thus sheer diversity ot re i eV ant information from the source. This would be advan- 

available utilities creates problem for users comparable to tageous eV en if such a facility were limited to groups of 

that created by information explosion. Users are now faced sources with response formats selected according to certain 

with the twin problems of which tool to use to inquire at constraining principles 
which information source. 40 

In the past efforts have been made to provide users with 3 ' SUMMARY OF THE INVENTION 

automatic, computer assisted services that can help solve It is a broad object of this invention to solve these 

these twin problems of the network revolution. For example, fundamental problems by a method and system that provide 

Al researchers have created several prototype software a personalized network robot, called a "netbot/' A netbot 

agents that help users with e-mail and netnews filtering 45 acts as a user's intelligent assistant by tracking available 

(Pattie Maes et al., 1993, Learning interface agents, Pro- network information sources, knowing the relevant infor- 

ceedings ofAAAI-93), agents that assist with World Wide mation and features of each particular source, and upon user 

Web browsing (H. Lieberman, 1995, Letizia: An agent that request determining which sources are relevant to a given 

assists web browsing, Proc. 15th Int. Joint Conf onAJ. pp. query, forwarding the query to the most relevant information 

924-929; Robert Armstrong et al, 1992, Webwatcher: A 50 sources, understanding the responses returned from each 

learning apprentice for the world wide web, Working Notes source, and integrating and intelligently presenting the query 

of the AAAI Spring Symposium; Information Gathering from results to the user. 

Heterogeneous, Distributed Environments, pp. 6-12, Stan- The netbots of this invention possess several advantages, 

ford University, AAAI Press), agents that schedule meetings including the following. First, a netbot returns only the most 

(Lisa Dent et al., 1992, A personal learning apprentice, Proc. 55 relevant information to the user. On the one hand, each user 

I.O//1 Nat. Conf. on A.I., pp. 96-103; Pattie Maes, 1994, query is forwarded only to the primary information sources 

Agents that reduce work and information overload, Comm. determined to be most relevant. On the other hand, infor- 

of the ACM 37(7): 3 1-40, 146; Tom Mitchell et al., 1994, mation source responses are parsed and understood so that 

Experience with a learning personal assistant, Comm. of the only the relevant data items are extracted for user presen- 

ACM 37(7) :8 1-91), and agents that perform internet-related 60 tation. Duplicate, stale, and irrelevant information items are 

tasks (0. Etzioni et al., 1994, A softbot-based interface to the discarded. Second, a netbot is fast. Since it automatically 

internet, CACM, 37(7): 72-75). Increasingly, the information searches the relevant primary sources in parallel, it can 

such agents need to access is available on the World Wide present information as quickly as the fastest primary source 

Web. Unfortunately, even a domain as standardized as the re turns a response. Despite changing conditions which cause 

WWW has turned out to pose significant problems for 65 different information sources to fluctuate in speed, a netbot 

automatic software agents. For one, although Web pages are integrator remains as fast as the fastest source. Sources that 

universally written in Hypertext Markup Language have no information to return to a query do not slow the user 
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since the nclboi simply ignores Ihem. Third, nelbols arc 5. DETAILED DESCRIPTION 

easily adapted to the ever-increasing number of network Fof c , arj ofdisclosure> and not by way of limitation, the 

mformat.on sources with ever-changing response formats. de(ai , ed dcscri , ion of a ne , bo , of (his jnvention js , ed 

Ne.bols utilize a new and novel declarative language for Js fl melhod 0f ^ for nc(work informalion acccss „ a 

describing information sources. Asource description is short 5 ffl Qr us i mplemenled lo perform that melh od, 

and easily understandable, and therefore is easily wntten and and ^ a nove| , anguage desjgned lQ ^ jn implemcnting 

maln,a,ned - this method and system. 

'nierefore, in one aspect the invention includes a method , n ^ fo|lowi flrs , a „ overview of |hese , s of , he 

for efficient accessto information sources on a network inven(ion fe Ied fol , owed> second> by a delailed dis . 

comprising preferably ana or more of the following steps: io cussion Qf individua , componenls . 
receiving a user query tor information; determining the 

information sources most relevant to this query; retrieving a 5.1. OVERVIEW OF NETBOT ARCHITECTURE 

description of each information source; formatting the query A nelbo , melho[| of rf ^ inventjon ^ 

according to this description in a manner suitable for each software and hardware faci , ities (hat ^ ^ , tner jn one 

information souree and transmitting the formatted query to >5 of ^ mlwotk ers ,„ ^ a user m 

the source; receiving responses from the information access (Q informa|ion , jn network a „ acned servcrs 

sources; tor each source, undeistandmg and extracting the (knQwn hcrejn Js ,. information FIC , ;1 generany 

relevant data fields according to the retrieved descnpt.on; me relationshi of a nelbol l0 , user and t0 

and presenting to the user Ibe relevant data from each networkcd in f orm ation sources. For example, user 1 

information source in an intelligent manner ranked by an 20 accesscs US( , r com lcr3m rougn stan( | ar d interface devices, 

estimate of its relevance. Advantageously, lhe.se steps are such as monitor 2 ,„ (he cou[se of wofk , he user nccds 

performed in parallel to the greatest extent possible. In information from infarma , ion sources?, attached to the user 

particular, at least, all queries are transmitted to all relevant ul6r , hrou ^ various network links> such M network 

information sources in parallel without waiting for mterven- ^ ^ 4 an£ , 6 ^ ^ informalion afe many> the 

ing responses. user can k ene f It f rom ass i s tance in finding needed informa- 

In another aspect the invention comprises a computer tion from re i evant information sources. This assistance is 

system and apparatus for performing one or more steps of prov ided by netbot 5, which maintains an awareness of 

the method of this invention. The user has a presentation available information sources and queries them through 

device attached to a network to which is also attached a links 6 on behalf of, or as an agent of, the user. Alternatively, 

plurality of information sources. The presentation device netbot 5 can parlly or wholly reside on user comp uter 3 or 

receives user queries and displays netbot responses. Further, be part j a ]i y or wholly distributed on the network and 

the presentation device performs one or more of the steps of accessed by the user through link 4 

the method of this invention. One or more of those steps not Groups of sources 7 having similar sorts of information 

performs on this dev,ce can advantageously be pertormed ^ imo ^ informalion 

on network attached netbot server computers, which respond domajns Fof e , one domain can be |hal of electroil i c 

to functional requests from the user device^ Optionally, the ^ for a , jcular duct; anotherdomain might include 

user device can range from a diskless hand-held terminal, to , (l|( , rnct inc)exes con , ai[li inforniatioI1 on the keyword 

a PC, to a workstation, and so forth. com6nt of various Wor , d wide Wcb ( < iwww ..) pag6S . 

In a further aspect the invention comprises a new lan- 4o , n a embodimeQt> , he nctbo( is composed of 

guage and language implementation to facilitate the creation ^ major functiona i modules . , user inlerface) an 

and maintenance of descriptions of information sources. int ator> and an I/0 manager . Briefly, the user interface 

Importantly, this language recogmzes relevant data fields in modu , e inlerac|s wjth ^ ^ |Q rcceive usef ies faf 

responses returned from information sources and is capable informali and to format and nt informalion 

of extracting aU such fields. In the preferred embodiment 45 ^ i<xeived ^ ^ network information 

this language has an action statement component and a Advantageously, the user interface is adapted to the 

regular expression component. The regular expression com- dfic infonnation domain bem accessed ^ in , , or 

ponent has novel features for creating modular hierarchical modu , e a a ^ from ^ usef m 

descriptions of regular expressions, for binding variables to se , ecls re , evam mformalioB so formats it for nelwork 

the correct sub-strings recognized during pattern match to a jo transmission , 0 each re , evant information Kctivts 

response of an information source, for performing arbitrary K s ff0m , nese M unde[slands thesc responseSi 

action language statements with multiple variable bindings, and ^ jons of (he ^ back , Q , he 

and for specifying backtrack free recognition of sub-stnngs ax[ module for disp , ay |Q lhe user ^ I/Q 

w ere po. 1 . manager module performs hardware, operating system, and 

4. BRIEF DESCRIPTION OF THE DRAWINGS 55 network specific interfacing for the user interface and inte- 

- . grator modules so that porting a netbot to different hardware 

These and other features, aspects, and advantages of the p i alformSj opcra ,i n g systeraS) or networks requires changes 

present invention will become better understood by refer- Qn| jn ^[.modularized code. 

ence to the accompanying drawings, following description, U[ a|(erna|ive , emenlalion eilher Qr ^ of 

and appended claims, where: *u ■ * r *u i/V^ > » L 

rr 60 the user interface or the I/O manager modules may be 

FIG. 1 illustrates generally a netbot of this invention; absem For examp i e} the f unc tions of these modules might 

FIGS. 2A-B illustrate exemplary user interfaces of fa already performed by other operating system compo- 

embodimcnts of the netbot of FIG. 1; nents. A netbot can be provide only one or more of the 

FIG. 3 illustrates exemplary functional components of the facilities disclosed without providing others. For example, in 

netbot of FIG. 1; and 65 some embodiments, a netbot is useful that preforms query 

FIG. 4 illustrates alternative hardware embodiments of routing, alone or in combination wilh relevance ranking. In 

the netbot of FIG. I. other embodiments, a netbot can simply format queries and 
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understand responses. Further, additional modules may be 
present in or accessed by a netbot. For example, a learning 
module can be present which provides for a netbot to acquire 
automatically the characteristics of a new network informa- 
tion source. Finally, as is known to those of skill in the art, 
the functions performed by the described modules may be 
divided or grouped in alternative fashions among a greater 
or lesser number of modules. 

In the absence of specific preferences, the processes of 
this invention can be implemented in complete procedural 
programming language, such as c, or a complete object 
oriented programming language, such as C++, on the dis- 
closed hardware configurations. 
The User Interface Module 

In more detail, the user interface module has both impor- 
tant functionality that is common to netbot user interfaces, 
whatever the information domain to which netbots are 
directed, and also has adaptations to the particular informa- 
tion domain of a particular netbot. Turning first to the 
preferable common functions, one such is the ability to 
remember a user* s preferences for interacting with a netbot. 
Such remembered preferences include, for example, 
whether a. netbot is to fetch pages in order to calculate their 
relevance itself, the number N or relevant information 
sources to query, display characteristics, etc. A second such 
common function is incremental display of information 
received from a query. Since each query usually causes a 
netbot to consult many independent information sources, 
results are often received at widely varying times. Immedi- 
ate display of asynchronously received results would cause 
such undesirable effects as screen flicker or disorienting 
rearrangement of already displayed data. Incremental dis- 
play accommodates, on the one hand, user desires to view 
information quickly with, on the other hand, user desires for 
a comprehensive view of all received information, sorted 
according to relevance. 

The incremental display strategy preferably provides one 
or more windows on the user's screen with several defined 
views of the query satisfaction process along with certain 
common user controls, such as screen buttons, for manipu- 
lating these windows. In one window, the user interface 
module presents lists of the information sources being 
consulted with each source symbolically represented as, for 
example, a network address, an icon, or another compact 
screen representation. Then, as an answer is received from 
a particular source, its associated screen representation 
changes appearance, e.g., by changing intensity, changing 
color, etc. Also displayed is a count of the total number of 
unique information items received currently. Optionally, 
clicking on the screen representation of an information 
source opens a further window with either information about 
this information source, or a display of the responses 
received from it, or access to the information source over the 
network, etc. 

One of the common controls is preferably implemented as 
a common show-me-button that when activated causes dis- 
play of another window in which a list of all currently 
received responses is presented ranked according to their 
estimated relevance to the query. Another common control 
is preferably implemented as a more-button in this latter 
window that when activated, causes re-display of all prior 
data items merged with items newly received since the last 
window display. The newly received items are merged into 
the display list in order or relevance, and distinguished from 
prior items by, for example, their color or intensity so that 
the user can avoid scanning prior items again. Optionally, 
clicking on a data item opens a still further informational 



<> 

u !< ! ving either the source of the item, a display of the 
k> ■* " tintaining the item, or access over the network to 
the m . v. of the item, etc. 

hi Llii ion to such common functions and controls, a 

5 netk)' 'iser interface module preferably implements specific 
designs formatting, and fields suitable to the information 
domain for which it is designed. For example, a netbot for 
comparison shopping in an information domain of electronic 
stores on the Internet can have a particular interface presen- 

10 tation containing labeled fields for product name, model, and 
price. On the other hand a netbot for access to an information 
domain of online Internet indexes can have an interface with 
fields labeled with elements derived from the query. 

In one netbot embodiment, most functions and modules 

15 reside on network attached servers which a user accesses 
remotely. For example, the user may access a netbot over the 
Internet with the World Wide Web protocols utilizing a web 
browser, such as Netscape. In this case the user interface 
builds HTML formatted pages which are transmitted over 

20 the network by the I/O manager. FIG. 2A generally illus- 
trates the user display from an example of such an 
embodiment, which is further directed to the information 
domain of online, electronic software stores. The netbot 
display of FIG. 2 A is divided into three sections. Section 11 

25 is a title section generally indicating that this display has 
results from a netbot for shopping, a "ShopBol." A netbot 
preferably also has a specific input query screen. Section 12 
presents the list of online stores currently being consulted 
represented by their WWW addresses 15, which are select- 

30 able to provide further information or direct WWW access. 
At 16 in section 12, those sources which have already 
returned query results are similarly represented. Section 13 
presents the results received so far formatted in accordance 
with this particular information domain into sections for the 

35 major PC operating systems. Each individual item returned, 
for example item 16, is formatted with product name, price, 
and an address for the originating information source. In this 
implementation, information display is controlled with the 
window scrolling and control facilities built into the web 

40 browser. This user interface is implemented as a HTML 
formatted page created at a netbot server and transmitted to 
the web browser. 

FIG. 2B generally illustrates the user display from another 
example of a web-browser-based user interface 

45 embodiment, which in this case, is directed to an informa- 
tion domain of WWW indexes or search engines. This 
display is also generally divided into three sections. Section 
71 displays a title for the netbot; section 72 displays the 
status of the current search; and section 73 displays the 

50 search results. In more detail, the display of section 71 
includes a logo for this netbot, "MC" standing for 
"MetaCrawler," a name chosen since WWW search engines 
are also known as "web crawlers," and controls to access 
certain, system level presentation features, such the Met a C- 

55 rawler home page and user feedback pages. The display of 
section 72 includes list 74 of the search engines being 
queried identified by their common names, the status of the 
current query in general and at each search engine, and 
common user controls. Generally, pie-chart icon 78 summa- 

60 rizes that 7 of the 8 search engines queried have already 
responded to the query. At search engine 75, known as 
"Lycos," the check mark indicates that a response containing 
information items has already been received. At search 
engine 76, known as "Inktomi/' the cross mark indicates that 

65 a response without any information items has already been 
received. On the other ha nil, search engine 77, known as 
" Galaxy," is visibly distinguished from the other search 
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engines lo indicate that it has not yet responded to the query. The query router is important to achieving this behavior 

'Ilie common controls of section 72 include more-button 79 because it permits the netbot to send requests only to 

to request the display of newly arrived search results, and information sources likely to have information relevant to a 

modify-search-button SO to request a new or modified query query. From a user query, the query router determines the 

be sent. Lastly, the display of section 73 includes the 5 relevance of each information source to the given query and 

information items returned from the search engines. Each returns the N most relevant sources. N is a parameter 

information item is displayed separately and includes title controllable at user preference and can be as small as 1. This 

81, descriptive text 82 if available, and line 83 with the URL relevance determination is preferably over inclusive rather 

of the web page for this item and the estimated relevance of than under inclusive. Occasionally including an irrelevant 

this item to the query, here "1000."The items are sorted for 10 information source is preferable to missing a relevant and 

display by descending values of the estimated relevance. important source. Further, it is preferable that this relevance 

The displayed items are scrolled using controls provided by determination be quick to compute, not requiring costly 

the web browser. This user interface is implemented as a processing techniques. 

Java applet downloaded from a netbot server and executed In a preferred embodiment, the query router calculates a 

by the web-browser. In this manner, the interface of FIG. 2B 15 numerical relevance rank value for each information source 

Is capable of greater interactivity than that of FIG. 2A. For that estimates the source's relevance. This calculation is 

example, it can poll the netbot server for current search based on the concept of "conceptual classes." Thus each 

status and update the status displays accordingly without information source is tagged in advance with the conceptual 

user action. classes for which it is relevant. Then the query router maps 

Although the user interface is described primarily in terms 20 each query to the conceptual classes relevant to it, and finds 

of windows and buttons, one of skill in the art will recognize information sources with conceptual classes shared by the 

thai this invention is adaptable to other display paradigms query. The mapping of a query to its conceptual classes is 

thai provide for display of information and input of user preferably done with a hash function, 

commands. For example, the user interface module can Aggregation Engine 

control the entire screen and present graphical displays 25 The aggregation engine is the coordinating function of the 

without intervention of a windowing system. integrator module. It receives the user query from the user 

The user interface module is preferentially implemented interface module and requests the query router to provide a 

with an object orient programming language supplemented list of the N information sources most relevant to the given 

with a class library providing windowing functions. A pref- query. Then it retrieves the N wrappers for the N information 

erable implementation uses the Java language together with 30 sources from the wrapper database. Guided by the N 

the java.awt package. See, for example, Flanagan, 1996, wrappers, the aggregation engine translates the query into 

Java In A Nutshell, O'Reilly & Associates, sections 5 and the request formats accepted by each of the N information 

19. sources and transfers the N requests to the I/O manager for 

1 ne Integrator Module network transmission. For some sources, the query may be 

FIG. 3 illustrates the preferred functional modules, data 35 in the format of a form to be returned. When a response is 

bases, and their functional interrelationship of netbot 30 in received from an information source, the aggregation 

general and of an integrator module 37 in particular. The engine, again guided by the appropriate wrapper, extracts 

integrator preferably consists of three functions: a query data from the response and places it into a list of data fields, 

router 39, a wrapper database 40, and an aggregation engine called a tuple format, relevant to the particular information 

38. These components are introduced here and described in 40 domain. Optionally, each tuple can be assigned a priority 

detail in the following. Given user query 31 delivered from order using a method appropriate to the particular informa- 

user interface module 34, the integrator first calls query tion domain. Finally when the incremental display manager 

router 39 to rank the network information sources known to requests data to present to the user, perhaps in response to a 

a netbot in order of relevance and to return the N more more -button request, the aggregation engine passes the 

relevant sources. Next the integrator retrieves the N wrap- 45 tuples to the user interface module, sorted in priority order 

pers for the N most relevant sources from wrapper database if a priority is determined 

40. These wrappers, which are descriptions of the informa- For example, if the information domain relates to Internet 

tion source and its requirements, are written in the wrapper online software vendors, then the tuples optionally contain 

description language of this invention. The retrieved wrap- such relevant fields as product name, manufacturer, software 

pers are used by aggregation engine 38, first, to format the 50 version number, operating system required, price, etc. An 

query into forms recognized by each information source, exemplary priority order of the tuples can be by price, by 

and second, to understand the information returned by each delivery delay, or other factor at user preference, 

information source 33 in order to eliminate extraneous For a further example, if the information domain relates 

formatting matter and to put the received information into a lo World Wide Web ("WWW") search engines, which index 

common format. This formatted information is then aggre- 55 information pages available generally on the WWW, then 

gated and passed to the user interface module for presenta- the tuples optionally contain such relevant fields as title of 

tion 32 to the user according to preferred incremental the indexed page, the universal resource locator ("URL") of 

display method 36. The user display is also controlled by that page, relevance scores estimating the relation of the 

stored user preferences 35. indexed page to the query, descriptive text, etc. An exem- 

The Query Router 60 plary priority order can be based on the netbot's normalized 

A well-behaved netbot should preferably use scarce net- estimate of the relevance of the indexed page to the query, 

work bandwidth and information source processing If the netbot does not retrieve the indexed page, then it sums 

resources in a competent and efficient and frugal manner, the normalized relevance estimates for this response that are 

while at the same time best answering each user query. Such returned from each of the search engines. If a search engine 

behavior minimizes resource usage, and thus achieves best 65 does not return a relevance estimate, a default value is used, 

overall performance, both for the individual netbot and for 'Hie obtained relevance estimates are then normalized by 

all netbots functioning simultaneously on a network. linearly adjusting the returned scores to have a common 
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maximum of, e.g., KKK). aim then mulliplying the adjusted The wrapper description language (hereinafter referred to 

scores by a confidence factor. This confidence factor, rang- as the "WDL") of this invention facilitates the semantic 

ing from 0 to I, is a predetermined estimate of the reliability description of queries, forms, and pages by using a declara- 

ofd particular information sources own relevance estimates. live description format that combines features from gram- 

For example, it can be determined by practical experience 5 mars and regular expressions. Merc an example of this 

with the information source's relevance estimates. description language is presented. A detailed description is 

Alternately, the user can request the netbot to retrieve the set forth in a later section. Syntax used follows conventions 

page in order to do its own relevance estimate. In an known to those of skill in the art for specifying grammars, 

exemplary embodiment, for queries requesting the presence including regular expressions. See, e.g., Schwartz, 1993, 

either of all query words or of any query words, the estimate Learning Perl, O'Reilly & Associates, Inc., chapter 7; Aho 

is determined by scanning the page and counting the number et al„ 1986, Compilers Principles, Tediniques, and Tools, 

of query words actually present, and then scaling the count Addison Wesley Publishing Co., section 3.3 

so that the presence of all words results in the common An exemplary description in WDL of a typical page 

maximum relevance value. For queries requesting the pres- returned from ao search engine follows here. The 

ence of a phrase, the estimate is determined, for example, by , , W ?L interpreter uses the page description to parse a page 

subtracting from the common maximum a normalized sum 15 f nd £ execute ^y specified action statements. Note that 

of the square of the distance in the page of each word of the 15 a t reserved m 'he WDL lha matche f , any 

. r . , • u t_ -r iL character string up to the first occurrence ot a mandatory 

phrase from its successor word in the phrase. I nereby, if the - „ . . ° .r . J 

* , following string literal, 

phrase appears contiguously in the page the relevance is ° 

high, whereas if the words of the phrase are widely separated 2Q 

on the page, the relevance is low. 

In summary, for the majority of information domains, the <pa g c>:> stuff "<di> H <item>* .*$ 

priority order is determined from a relevance computation, <itcm>::= stuff "<du>" stuff "=$'■ (stuff) "$"><strong>' 

as in the WWW search engine example. However, for ^'{^^wl^sa^ ^ 

certain domains such as online software vendors, a priority 25 < i> ouipu 

order can be simply determined from the values of one or 

more numeric fields of the response tuples. This describes a page made up of, among other data, a 

The Wrapper Database sequence of zero or more items. In detail, a page is specified 

The preferred manner of describing information sources to consist of siuff ' then the strin g < c1l> > then zero or more 

and their capabilities, in particular their query formats and „ items > thcn ?*?™ mon characters (V'X then finally the end 

response formats, is with compact, modular, declarative 30 of , lhe P a S e P )■ "j \&™% jj an item includes three fields of 

... , 0 . ... M re evance, denoted by (stutl) and reterred to sequentially by 

descnptioas called wrappers. Smce a netbot can access from ^ l • • . \ \ 

, . , , t ri , c . c $0, $1, and $2, that when an item is recognized are output 

several hundred to many thousands of information sources, (he ^ y , s(atement !n delai| an { * m js i|ie( j t0 

the descriptions of the sources are preferably compact, consist of stuft; then {b& s(ring ^ men mofe ^ then , he 

requiring a minimum of storage. Nrther, since new infor- 35 s|ring then tne first field of imercst , ater referrec j to by 

mation sources are frequently created and existing sources $0> then , hc s|ring «« ><slrong? ,' s t h en the second field of 

frequently change their format, easy maintenance of source interest later referred to by Si, then the string "</strong></ 

descriptions is important. A modular, declarative a >", then more stuff, then the string "<dd>", then the third 

description, instead of a complex procedural description, field of interest later referred to by $2, then the string 

facilitates such maintenance. In one embodiment of this 4Q "<br>" The action statement within braces, in this case 

invention, wrappers can be learned by a separate module for specifying output of the variables $0, $1, and $2 with the 

information sources having sufficiently regular formats. bindings when an <item> is matched, is executed when each 

For each information source, in an exemplary item is recognized, 

embodiment, each wrapper advantageously includes the The Netbot System 

following information: The preferred functional structure of a netbot can be 

1. The Universal Resource Locator ("URL") address of 45 assigned to system hardware components in various alter- 
the information source; natives * n * preferred alternative in any case depends on 

_ ™ 4 , i e . u which allocation of function achieves a rapid response and 

2. The conceptual classes of the source; reasonable cost. FIG. 4 generally illustrates exemplary net- 

3. A description of the mapping from query arguments, bol hardware embodiments and options in view of the 
e.g. words or phrases, to fields of the query or HTML 50 previous general description. It illustrates the interrelation- 
defined form used to interrogate the source (including ship of uscr cornpu ter elements 51-56, network 57, infor- 
site support for any, all, phrase, or proximity queries); mation sources 58, and netbot server computers 59-61. 

4. A description of the format of the query response or Computer 51 is a user computer including a processor, 
HTML page layout that enables parsing of relevant memory, and various attached peripherals. Such peripherals 
information from other information and extraneous 55 include display device 52, or other device for user 
formatting matter. interaction, network attachment 54, optional hard disk stor- 

At least items 3 and 4 are advantageously written in the age 53, and so forth. Computer 51 can be alternatively a 

wrapper description language of this invention. network device without permanent storage, a PC, a 

A netbot can retrieve wrappers in various manners. In one workstation, or more powerful computer. It is preferred that 

embodiment, the information source itself can supply its 60 computer 51 be a PC or a workstation running one of the 

own wrapper upon request from a netbot. In an alternative Windows operating systems, the Macintosh operating 

embodiment, the netbot can provide its own wrappers in system, or UNIX. Present in the memory of user computer 

various manners. For example, wrappers can be built in the 51 is, among other software, local netbot software 55 and 

netbot itself, especially where the netbot accesses only a few local system components 56. The local netbot software 

information sources. Also, wrappers can be stored in a local 65 implements one of more of the netbot functions. The local 

database or can be downloaded on demand from a central- system components can include, for example, a web 

ized database. browser. 
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Network 57 can be any network wilh a plurality of 
attached information sources 58, which are can be optionally 
conceptually classified by subject matter into information 
domains. In a preferred embodiment, network 57 is the 
public Internet or a private intranet supporting the TCP/IP 5 
suite of protocols, including such user level protocols as 
FY?, HTTP, and so forth. The information sources are server 
computers which make their stored information available 
using the protocols supported by network 57. Such infor- 
mation can include databases of periodicals, newspapers, JQ 
etc., information on or produced by particular commercial, 
educational, or other types of organization, facilities for is 
electronic commerce, etc. 

In such a network, a netbot can have various embodi- 
ments. In an entirely local embodiment, all netbot functions 
reside in local netbot software 55 on user computer 51, 15 
which in this embodiment must have sufficient processing 
and storage capabilities. In alternative embodiments, one or 
more of the disclosed netbot functions can be distributed on 
other network attached computers. 

For example, computer 59 is a wrapper server for accept- 20 
ing requests for downloading wrappers from its wrapper 
database. The wrapper database can be stored in memory or 
on disk using any data management system capable of 
storing and retrieving compact textual descriptions. Com- 
puter 60 is a query server for performing query routing by 25 
accepting queries and returning the N most relevant infor- 
mation sources from the many tens or hundreds of thousands 
about which it stores information. Computer 61 is a netbot 
server for performing the integrator module function by 
accepting user queries and returning search results, perhaps 30 
using the facilities of wrapper server 59 and query server 60. 
With these network servers, local netbot software preferably 
only supports the user interface, which may be delegated 
entirely to a web browser. Alternately, it can further include 
the aggregation engine, which makes query routing requests 35 
to query server 59 and wrapper requests to wrapper server 
60. Further, it can include one or both of these latter 
functions. 

The various computers of a netbot system can be provided 
with software for performing the methods of this invention 40 
either from computer readable media or by loading across a 
network. This invention is adaptable to known magnetic and 
optic media, such as disks, tapes and CD-ROM. 

5.2. THE I/O MANAGER 

45 

The I/O manager module performs hardware, operating 
system, and network specific interfacing for the integrator 
module. Network interfacing includes the tasks of sending 
requests and receiving responses from network attached 
information sources. An important application of the netbots 50 
of this invention is to information retrieval over the Internet. 
In this application the I/O manager is responsible for imple- 
menting the relevant protocols of the WWW, Gopher, FTP, 
Internet tools, etc. Optionally, it can temporarily cache pages 
and other data in order to improve response time. 55 

Operating system interfacing includes the task of window 
management for the user interface module and access to the 
wrapper database, if present. 

Preferably, the I/O manager is constructed from commer- 
cially available protocol stacks, windowing libraries, such as 60 
the Java.awt package, and other tools. In some 
implementations, more or less of the I/O manager functions 
can be performed by other system components on the 
network attached computer. Optionally, the I/O manager is 
designed to be scalable to multiple machines, to not require 65 
multi-threaded or reentrant code, and to be cross platform 
and persistent. 
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5.3. THE AGGREGATION ENGINE 

In the preferred embodiment the functions previuusly 
identified for the aggregation engine component of this 
invention are performed by the following process. Searches 
of the information source proceed in parallel because all 
requests are transmitted without waiting for any responses. 

1. Receive a user query from the user interface module; 

2. Perform query routing to determine the N information 
sources most relevant to the user query; 

3. For each of the N relevant information sources, do: 

A. Retrieve the wrapper for this information source (for 
example, from the source itself or from a local or 
remote wrapper database); 

B. Guided by the wrapper, format the user query into 
the form or format required by the information 
source; 

C. Transfer the translated command to the I/O module 
for transmission to the information source; 

4. Initialize the list of responses to be empty; 

5. Until a user specified time limit is reached, do: 

A. When an information source response has been 
received by the I/O module and transferred to the 
integrator, then: 

i. Guided by the wrapper for this information source, 
parse the response to understand the information 
returned, discard the site-specific formatting text 
and other irrelevant matter, and gather relevant 
fields into tuples; 

ii. Add each tuple to the list of tuples, optionally 
performing priority ranking, duplicate 
elimination, etc.; 

B. Wait for the next response; 

6. Deliver the list of created tuples to the user interface 
module on request, which can be due, for example, to 
user activation of the show-me-buttori or more-button 
controls. 

When multiple information sources are queried, it is 
preferable in step 6 to present to the user interface module, 
and thereby to the user, a single merged lLst of tuples 
extracted from the responses and sorted according to an 
estimate of the significance or relevance of each tuple to the 
user. Such a estimate is preferably made according to 
methods specific to the information domain to which the 
netbot is directed. For certain domains, a significance esti- 
mate can be made directly from the value of one or more 
data fields in the tuples. For example, in a domain of 
electronic shopping, significance can be related only to price 
or delivery date, according to user preference. For most 
domains, however, a significance estimate is made accord- 
ing to relevance of the returned information, which must be 
determined by examining the responses from each sources. 

In a preferred embodiment of such a relevance 
determination, the user has the choice of whether or not to 
have the netbot examine all information pages itself. If the 
user so chooses, the relevance is determined by the netbot 
according to a domain specific Analyze function. In a 
domain of information queries, an exemplary Analyze func- 
tion finds the number and location of query words in the 
returned response. For keyword queries, responses with 
more keywords present with greater frequency are more 
relevant. For phrase based queries, responses having the 
words of the phrase more closely spaced, for example in one 
sentence or even in contiguous sequence, are more relevant. 
In other domains, appropriate Analyze functions are pro- 
vided. 

If the user chooses not to have the netbot examine the 
responses, the netbot relies on relevance estimates returned 
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from the information sources. If a particular source docs not 
return a relevance estimate, a default value is used. These 
estimates are then normalized to be between, e.g., 0 and 
1000, and multiplied by a confidence ranking factor This 
confidence factor, ranging from 0 to 1, is a predetermined 
estimate of the reliability of a particular information 
source's own relevance estimates. For example, it can be 
determined by practical experience with the source's esti- 
mates. Where the same tuple is returned from two or more 
sources, the relevance values from all those sources are 
combined. Optionally, the relevance estimates returned from 
each source are adjusted to have a uniform distribution in the 
normalization range. 

In a particular detailed embodiment, this determination is 
performed according to the following process. Here, query 
routing has determined a list of K information sources, 
source_k with k from 1 to K, and returned their confidence 
ranks, crank„k. Each of these sources has been queried, 
returned responses, and K lists of information tuples, 
tuples_J where j is from 1 to length_k, have been extracted 
from these responses. The user's preference for netbot 
analysis is recorded in verification flag V. The variable 
(.score represents the composite relevance score for tuple t; 
the variable t.sourcescore_k represents the relevance esti- 
mate returned from information source_k for the response 
that tuple t was extracted from. 
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-continued 



Input: list of K sources with their confidence rank pairs 
(source_k, crank_k), obtained from the query 
routing system; K ordered lists of tuples, 
tuplcs__k, of length length_k, obtained from source 
source_k; and the verification flag V (Boolean), 
obtained from user preferences 
Output: Merged list of all tuples sorted by relevance. 
/* Main routine */ 
IF (V is true) THEN 

FOREACH k - i . . . K 

FOR EACH tuple t n tuples_k 

page = the HTML page that tuple t points to, 

downloaded if necessary 
t.score => Analyze(page);, 
ELSE /* V is false */ 

FOR EACH k - 1 . . . K 
Normal izeScores (tuples_k) 
AdjuslBy Height (tuples_k) 
AdjustBy Se rvice Ranking(tuples_k) 
/* Merge result tuples, t, into MERGED_JLIST and 
determine a composite relevance score, t.score, from the 
scores returned by the information sources, 
t.sourccscorc_Jc; the same tuple returned from multiple 
sources has its composite score incremented by the 
source score from each source V 
FOREACH k - 1 . . . K 
FOR EACH tuple t in tuples_k 

IF t is not in the M ERG ED_LtST THEN 
Add t to the MERGED__LIST 
t.score = t.sourcescore„k 
ELSE 

t.score - Lscore + t.sourcescore_k 
ENDIF 

ENDIF 

SORT oil tuples t by. t.score and discard duplicates 

OUTPUT sorted tuples 

EXIT /* finished relevance ranking */ 

/* Subroutines V 

SUBROUTINE Normal izeScores (tuplcs_k) 
/* If information sourcejc returns relevance estimates, 
normalize them to fall in the range from 0 to 1000; 
otherwise, use a default relevance estimate */ 

/* "s" is the k'th information source's relevance score 
for the first tuple on the list of tuples from source_k 
"/ 

s = lup!cs_k[ 1 ].sourccscorc_k 
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IF(s — 0)THEN 

/* this information source returns no scores, 

therefore use default V 

FOREACH tuple t in tuples_k 

L source sco re k = 1000 

ELSE 

scaling factor = 1000.0 / s; 

FOREACH tuple t in tuples_k 
Lsourcescore_k - t.sourcescore_k * 
scaling factor; 

ENDIF 
ENDSUB 

SUBROUTINE Adjust By Height (tuples_k) 
/* Adjust the source scores to have a uniform percentile 
distribution; for example for 10 tuples, the first tuple's, 
source score is adjusted to 100% of its source score, the 
second tuple's source score is adjusted to 90% of its source 
score, etc. */ 

percen!__step = 100 / Length_k; 

percenl_ofT = 0; 

FOREACH tuple t in tuples_k 

Lsourcescorc_k = t.sourccscorc_k * (100 - 
percenuoO) / 100; 

percenl_off = percent_off + percent_step; 
ENDSUB 

SUBROUTINE AdjustByServiceRanking (tuples_k) 
f* Each Service is assigned a percentile ranking indicating 
the confidence given to its returned source scores; this 
ranking is used to scale the returned source scores for each 
tuple; for example, a 90% confidence ranking means that each 
source score is scaled by 0.9 */ 
FOREACH tuple t in tuples_k 

t.sourcescore k = t.sourcescore k * crank_k 

ENDSUB 



One of skill in the art will recognize that these processes 
arc amenable to routine alterations and enhancements that 
perform the same functions in the same manners. In par- 
ticular other values for the normalization range and default 
value for the information source relevance estimates can be 
used. 'Phis invention includes such routine alterations. These 
processes are preferably implemented in C++, but can 
alternatively be implemented in any procedural or object- 
oriented programming language. 



5.4. THE QUERY ROUTER 

The query router receives as input a user query expressed 

45 as a list of words or keywords and returns as output a list of 
N information sources ordered by their likely relevance to 
the input query. Determination of these information sources 
is optimized for speed and over inclusive ness. Occasionally 
including an irrelevant information source is preferably to 

50 omitting a relevant source. 

The preferred query router is based on the principle of 
assigning relevant concepts to information sources and 
query words. In advance, a set of concepts is chosen to 
describe the information sources of the one or more infor- 

55 malion domains to which one or more netbols are directed. 
For each information source in the domains, the relevance of 
thai source to each of the chosen concepts is judged. Further, 
each word that can appear in possible queries is examined to 
determine which of the chosen concepts are relevant to the 

60 word. Then, upon receiving the words or keywords of a 
query, the concepts associated with these words are 
determined, and then the information sources relevant to 
these concepts arc found. The ranked relevance of each 
source is determined by combining the individual relevances 

65 of the source to all the concepts of the query. The case of 
phrase based queries is preferably handled by generating 
separate data for this query type. 
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The preferred implementation of this process utilizes four forms the same functions in the same manner. This invention 

tables containing relevance information. In the following, W includes such routine alterations. 

is chosen to be somewhat bigger, e.g. 10%, than the number This process is preferably implemented in C++, but can 

of words that can appear in possible queries; C is the number alternatively be implemented in any procedural or object- 

of chosen concepts; and S is the number of information 5 oriented programming language. In the case of a query 

sources in the information domain. WO RD2 CONCERT ] >s rouler wnich maimains information on a large number, e.g., 

a table of W vectors of C bits, where the C bits of the vector tens of thousands, of information sources, the query router 

for a word indicate which of the C concepts are relevant to * preferably implemented as a server process on a server 

that word. CONCEPT2SOURCE[ I ] is a C by S table. For computer to accommodate the size of the required data 

each of the C concepts and S sources, the corresponding 10 stmctu res and the processmg requirements of query routing. 

entry of this table contains the relevance value of that source ^n-^Jp 1 ^:?; COnSt 1 ^ Cl, ° 1 D t ° f J he la , b c 

/ „ . . r .. i c ■ *u WORD2CONCLP r l begins with the selection of concepts to 

to that concept. For example, if entry <i,j> equals 5 ^ the j-th characterize the information doniain of interes t and the 

information source has a relevance weight of 5 with respect dctcrminalion of words and phrase s likely to occur in user 

to the Mh concept. CONCbPT2SOURCE[ I ] is used when quefies For Mch concept) |hc following actions are per . 

searching by words. For searching by phrases, the table is formed. The words and phrases associated with that concept 

CONCEPT2PHRASE[ I ] similarly relates concepts to or t0 which lhe concept rc i ates are assigned to the string 

sources. Finally, D EFAULT- R ELE VAN CE[ ] has a default arrays KEYS[ ] and PHRASES[ ]. Then the following 

relevance weight for each of the S information sources. process is carried out. 

The preferred implementation performs the following 

process. 20 

1. For each of the S information sources, set 

RFT FVANPFnUnFFAUl T-RFI FVANCFfiF F0R { equals 1 t0 thc number of elements 10 1 DO 

Kfcl.fcV/\P^LUj=L>LI AULJ KLLLVAlNLqjj, App|y ^ previous , y ^ ^ {Q K^i] 

2. For each word in the User query, do: to obtain a number between 0 and w 

A. Compute a hash function On the word Obtaining a SETT the bit matching the current coccept in 

number, M, between zero and W. Any suitable hash 25 W0RD2C0NCEP'rtM] 

^ t J K , FOR 1 equals 1 to the number of elements in PHRASESt ], 

functions is adaptable to this process. An exemplary do 

hash function is found in Sedgewick, 1990, AlgO- Apply the picviously used hash function to KEYS[iJ 

rithrns in C, Addison -Wesley Publishing Co., chapter io obtain a number between o and w 

|g SET the bit matching thc current concept in 

B. Let the C bit vector V equal to WORD2CONCEFr 30 worp2CONCEPTTM]. 

[M]; 

C. Combine the relevances for all the concepts in V to These actions are repealed for each chosen concept, 
the relevance for the information sources by per- Alternatively, concept information can be stored along with 
forming: the string information, using either open or closed hashing, 

35 in order to preserve accurate string to concept matching. 

5.5. THE WRAPPER DEFINITION LANGUAGE 



For i from o to c. do: This section first presents introductory material on thc use 

If (i-th bit of vis t)then 0 f tnc WDL in wrappers. Next, the two principal compo- 

FOR i = J to S DO 

RnLEVANCH[jl= RELEVANCE[ji + 40 nents of thc WDL ™ ,nc action language and the regular 

CONCEPT2SOURCE[ijj expression language — are described in detail. Finally, an 

exemplary embodiment of the WDL is presented. 



XM t ■ u ■ • a 4- *u .. ».» A wrapper is a description of an information source and 

Monotonically increasing tunction other than + can , . . . . . . . ... . 

Jn . „ ' . f „ f u; t . : rt ^;„:j„ ft i „. ant how a Netbot should interact with it, in particular how to 

also be used to combine the individual concept _. . * r 

, • * c i , i rt 45 format requests to it and how to understand responses from 

relevances into a final relevance; , tXIin 7 - 4 * «- • . . t 

D. Combine.the.relevances for all the wordsin : the user f • NetB ° ts <° interact efficiently with hundreds ; to many 

query, fo7exarnp!e by adding Ihem together. thousands of .nformationsttes on a network. This .nteract.on 

„ , - .. . , jj.. .1 j presents two requirements: first, compact storage of a 

3. In the case of searching by phrases, addmonally do: descri tion of an intbrmalion using a representation 

A. Concatenate all the words of the user query phrase; „ ,™ T . , f : u . , ... , 
n „ , . r . . J 50 encoded in the WDL; and second, use of this description to 

B. Compute the hash tunction on the phrase and , . . . r ~. e , 

. w i .^.1- . xi i understand the information source. Currently, tor example, 

tTmrnNrMMi VC nelboIS necd 10 formal re< l uesls 10 a S0UB!e and t0 P arse 

WOKD2CONCbl I^MJ, . ... useful information from the pages returned by the source 

C. Combine the relevances for ail the concepts ,n V to while irrelevant formaUi inforraation . M infor . 
the relevance for the information sources by per- J5 malion become more functional> nelbols will need t0 

1 process interactions more complex than simple request- 
response pairs. Therefore, wrappers and the WDL preferably 

have the following features: 

For i from o to c, do: 1) WDL descriptions are easy to write, easier than, for 
if (i-th bit of vis 'l') then 60 example, in C++. This is important because new infor- 
FOR j = l to s DO mation sources are created frequently and existing ser- 
KHUVANCEUl-OONCEPra-HRASEhlJ ^ change ^ ^ 0ptionally> w £ pper 

descriptions can be automatically generated using 

4. Sort information sources based on their RELEVANCE, machine learning techniques in information domains 
and return the N most relevant sources. 65 where responses have a regular, predetermined format. 

One of skill in the art will recognize that this process is 2) Wrapper description are small. This is important so they 

amenable to routine alterations and enhancements that per- can, for example, be stored locally in a database or 
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quickly transmitted from a server to a netbot running in a 
client, even over a slow network connection. Optionally, 
information sources can supply their own wrappers on 
request. 

3) WD L descriptions can be automatically compiled into fast 
finite state automata that quickly parse the information 
returned by information sources. 

4) Using wrappers with WDL permits netbots to adapt to 
new types of information formats and new types of 
information server interactions in the future. 

The wrapper description preferably specifies at least two 
processes: first, requesting information from an information 
source, e.g., how to fetch the appropriate HTML formatted 
page from a source; and second, how to parse returned pages 
to extract the relevant data. To perform the first process, the 
WDL includes an action language component, which is an 
extensible language of expressions and statements. To per- 
form the second process, the WDL includes a regular 
expression language component, which is an extensive and 
novel means of specifying regular expression pattern match- 
ing facilities. 

In alternative implementations, netbots can utilize alter- 
native pattern matching facilities known to those of skill in 
the art. For example, the regular expression component 
could be replaced with a context free language specification 
("CFL"). In this case, implementation of the WDL can 
follow techniques known for construction of compiler 
compilers, for example YACC. However, where possible, 
regular expression pattern matching is preferred because of 
its straightforward specification and rapid execution. 
An Example Of The Regular Expression component 

Regular expressions are advantageous for describing the 
format of the information returned from many information 
sources. The regular expression component of the WDL 
augments prior regular expression matching facilities in 
several novel manners. First, it permits programming lan- 
guage facilities of the action language, e.g. statements and 
expressions, to be executed when regular expressions are 
recognized and with variable bindings as determined by 
partial matches recognized during the overall recognition. 
Second, the preferred implementation compiles regular 
expressions into compact and efficient finite state automata. 
Third, it encourages the efficient and intuitive expression of 
complex regular expression in a nested manner. And fourth, 
it possesses an efficient backtrack-free padding facility. 

An exemplary wrapper follows which is suitable for an 
Internet information search engine and is written in the 
WDL. The significance of each portion is explained in 
comments, which are surrounded by "/*" and "*/." 

/* A list of input query words is passed to the wrapper 
from the aggregation engine, which is executing the 
wrapper for the information source. The argv( ) func- 
tion of the action language extracts a list of the input 
query words. */ 

{Skeywords = argv(2); 

/* The request to be sent to the information source is 
calculated by concatenating, indicated by the opera- 
tor of the action language, three separate strings: one, 
the Internet URL address of information source and the 
initial query format string; two, the query words to 
search; and three, the remainder of the query format 
string for this information source; and */ 



10 



15 



20 
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50 



55 



60 



Surl 



" http 7/scarchcr.source .com/scarcher.cgi?que ry 
=" . Skeywords . "«SonlyrM3": 



/* The fetch action statement transfers this query the I/O 
module for network transmission, and then waits for 
the HTML formatted response. */ 

Spage = fetch(0, Surl,"); 

/* The HTML formatted response text is parsed using the 
following regular expression grammar. */} 

$ result = parse (Spage, <page>); } 

/* An response from this exemplary source consists of a 
page containing zero or more information items. This is 
hierarchically described in the regular expression lan- 
guage by expressions for <page> and <item>, where 
<page> refers to <item>s. In particular, a page includes 
a chunk of text followed by the siring "results 
returned . . .", then followed by zero or more items. */ 



<page> 



::- stuff 'results returned, ranked by' 
<ttem> * END 



30 



* Each item consists of irrelevant text, HTML formatting 
codes, and relevant data fields. The relevant data fields 
are enclosed in parentheses and referred to sequentially 
by $0, $1, and $2. These fields, along with the number 
500, are output as a tuple when the "output" statement 
of the action language is executed upon recognition of 
an <itcm>. The particular meaning of the definition of 
<item>will become apparent in the next section. */ 



stuff "<hr>* stuff '<centerxbxa href="' 
(stuff) "" stuff V (stuff) *</a>' 
stuff *<font size=-l>" (stuff) '^font^ 
[ output ($0, $1, $2, 500); } END 



5.5.1. 



45 



DESCRIPTION OF THE WDL 
COMPONENTS 



This section describes the preferred features of the action 
language and the regular expression components of the 
WDL. This invention is not limited to the descriptions 
presented. These descriptions make use of known methods 
for describing grammars, including regular expressions, and 
one of skill in the art will recognize that there are many 
substantially equivalent descriptions. Such equivalent 
descriptions include, but are not limited to, those resulting 
from renaming the described syntactic elements or from 
applying known grammatical transformations to the pre- 
sented syntax. This invention comprehends also these sub- 
stantially equivalent descriptions. 
ITie Action Language Component 

The action language includes certain preferred base-level 
features: an assignment statement; sequencing statements 
such as "if and "while" constructs; a tuple output statement; 
string and numeric expressions; string, numeric, and boolean 
operators; and certain built-in functions. It will be apparent 
to those of skill in the art that a base-level action language 
of equivalent semantic expressive ability can be constructed 
with slightly different choices of features. For example, the 
while statement can be replaced by a goto statement. The 
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action language component of this invention comprehends 4) cq, le, ge, It, gt, ne: perform the indicated charack - 
such known equivalent formulations of the semantics dis- by-character comparison of the ASCII codes of the 

closed. Further, in optional embodiments, the preferred surrounding strings and return the floating point values 

base-level features can be augmented with such additional 0.0 or 1.0 accordingly; 

features as: additional sequencing statements such as "for" 5 5) !: return 1 if argument is the number 0 or the string 
or "repeat" loops; user defined functions; additional string, otherwise return 0; 

numeric, and boolean expressions and operators such as are 6) &&: evaluate the first argument, if the first argument is 
found in C or C++; additional built-in functions; and array zero or stop and return that value, else evaluate the 

variables. Such additional features can be added to the second argument and return the latter value; 

preferred base- level language in known manners. See, e.g., 10 7) ||: evaluate the first argument, if first argument is neither 
Aho et al., 1986, Compilers Principles, Tedmiques, and zero nor stop and return that value immediately; else 

Tools, Addison Wesley Publishing Co. evaluate the second argument and return the latter 

The syntax of the preferred base-level action language is value, 
given by the following grammar expressed in a standard These operators have the following precedence in order 
notation. 15 from highest to lowest: 

1. any expression inside parentheses (highest precedence) 

2. *,/ 

3. +, - . 

4. <=, >=, <, >, !=, nc, eq, le, It, gt, ge 



Statement ::- 

$VarinblcNamc Expression , r g . . . 

j 'iC Expression Statement [ "else" Statement ] 20 5. ! (boolean not) 

! "while" Expression Statement 6. && (boolean and) 

! ■{■ statement* '}■ 7. || (boolean or) (lowest precedence) 

Ex rlssion 0 ^ 1 " T [ ExpressioD < VEx P ression >* 1 " r r All operators are left-associative. 

Xpr string-con5tant No variable declarations are needed. First, all variables in 

! float-constant 25 the action language are distinguished by being preceded by 

I SvariableName a "$» t Those special variable denoting sub-strings matched 

' T P E S 1e 0 s D si°o P n T PrCSSi0n by rcgular ex P ression consist of a followed by an 
! funct^nN a S me n "(- [ Expression {-/Expression >•]")- integer, e.g., $0 or $1. Second, at run time, automatic type 
conversions, between string and floating point types, is done 

30 dynamically. If an operator expects a number but gets a 

Thus, the action language component is defined in terms slrin& the string fc converted t0 a number by ca!ling lhe c 

of statements and expressions. A statement can be: an Ub fa ncXion a(of( } which convcr|s tne ASCII repre sen- 

assignment statement, which assigns a new value to a scalar (alion of a number intQ its imefnal floaling . point rcprcsen . 

vanablc; an if statement; a while loop; a compound tat i on . if an operator expects a is string argument but gets a 

statement, which is a sequence of other statements enclosed 35 numbcr it uses the c , ibrary ^ nci]on sprintf( « %r ») t0 

by and «}"; or an output statement. Except for the output CQnvert {{ {Q a strf refercnccd variab i es thal have 

statement, the statements function in the same manner as in QOt beefi assigncd arc ^ the default values 0 or «» as 

other procedural programming languages, e.g., C, C++, or appropriate 

Pascal. For the if and while statements, the conditional The action language component has several built-in func- 

argument is considered true if it is etther a non-zero number 4Q Uons The foUowing are preferred base-level built-in func- 

or a non-null string, i.e., not the empty string "". The t - ons 

OUTPUT statement allows a wrapper to return information 1} a ( )? argv( ); when a nelbot executcs a w f> {{ can 

matched in responses trom the information source to the thfi wrapper one or more arguments . These usualIy 

netbot module executing the wrapper. For example, execut- represent query parameters, query words, or query keywords 

ing the statement "output (arg_l, . . . , arg_n) causes the 45 supplied by the usef arguments are accessed from 

tuple <arg_l, . . . , arg_n> to be returned from the wrapper withifl the Wfapper | anguage by the functions argc( which 

to the netbot. returns the number of arguments passed to the wrapper, and 

An expression can be: a string constant, which is a symbol ^ wWch retums the n _ tQ argument . 

string surrounded by quotation marks; a floating point 2) fetch( ): This function preferably interfaces with the I/O 

number; a variable name, which is a name or an integer 5Q modu i e t0 transfer a string containing the network address of 

preceded by a dollar sign; an infix operator applied to two an information and perhaps query para meters over a 

surrounding sub-expressions; a sub-expression, which is network tQ the addressed information source according to 

enclosed in T and ")"; or a call to a built-in function. thc protocol, and returns a string containing the 

Further, the language provides operators standard in pro- response of the information source. Wrappers use this func- 

cedural programming languages, including the following: 55 tion t0 query information sources an d retrieve pages, 

arithmetic operators ("+", "*", "/"); numeric compari- 3) parse (<string>,<nonterminal>): This function lakes a 



son operators ("<", "=", ">", "<=", ">=", "!="); string 



string and attempts to match it to the regular expression 



comparison operators ( It , eq gt, le , ge , nc ); a correspooding to the given <nonterminal>, as defined by the 

string concatenation operator ("."); and boolean operators regular expression i angua ge component of the WDL This 

(<<&&", "||", " !"). rhese operators have the following seman- 6Q funclion returns « L » if the malch was SUC cessful, or «0" if the 

tics: string did not match the regular expression. Wrappers use 

1) +,-,*,/: perform the indicated arithmetic operation on tnis function to parse the response of an information source, 
the surrounding expressions; ^ n exemplary wrapper includes the following sequence 

2) .: concatenate the surrounding strings; of action language statements. First Ls a series of statements 

3) ==, <=, >=, <, >, !=: perform the indicated numeric 65 using argc( ) and argv() to obtain the user query parameters 
comparison of the surrounding numbers and return the and to initialize a string variable, e.g., $url, with a string 
floating point values 0.0 or 1.0 accordingly; containing the URL of the appropriate page at the informa- 
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tion source together with an appropriately formatted query. A second novel facility permits skipping groups of char- 
Next, the assignment statement "$html_text=fetch (Surl)" acters in a string without backtracking. In prior regular 
fetches the query-response page into another string variable, expression matching systems, this quite common require- 
e.g., $htm,_Jext. Finally, the function parse($html_text, ment is implemented in a manner that requires the storage of 
<page>) attempts to match the returned html text with the 5 many backtracking points on a stack and can lead to exten- 
regular expression, <page>, which describes the sought sive backtracking. For example, the prior art Perl idiom ".*", 
page. which matches any number of occurrences of any character, 
The Regular Expression Language or the Perl idiom "[Ad]*^, which matches atl non-digit 
The regular expression ("reg-exp") component of the characters up to the first occurrence of a digit, can cause 
WDL matches strings against regular expressions. It has 1Q extensive back-tracking during a match. This is inefficient 
been found that regular expressions are convenient to and not pre f e rable, since information responses should be 
describe the format of responses returned from a wide range fapidly pareed See> c g Schwartz> 1993f Leaming p erit 
of information sources. However, the reg-exp language of O'Reilly & Associates, Inc., chapter 7. To solve this 
this invention is capable of matching this information so that b! (he comp onent introduces simple and 
relevant fields can be extracted in a more rapid and more ^ for ^ common idiom: 
convenient manner than can prior languages ana systems, ,J J M 
such as AWK or Perl. The regular expression component t stutt literal-siring 

includes novel facilities that solve the problems with prior where stufr 15 a reserved word. Stuff is defined to match all 

matching systems, and thereby, permit its use by netbots. characters from the current character, up to but not 

A first novel facility allows the specification of regular including, the first occurrence of the string "literal-string," 

expressions to be broken into pieces in a manner similar to 20 which must be a string literal. This construct allows a 

a context-free grammar, which specifies a language by a set compact, efficient, backtrack-free implementation of this 

of rules for nonterminals in the grammar. See, e.g., Aho et common and important idiom. 

al., 1986, Compilers Principles, Techniques, and Tools, third novel facility allows relevant data fields to be 

Addison Wesley Publishing Co., section 4.2. Writing a extracted from arbitrarily complex regular expressions, 

single regular expression to represent the format of a page 25 Action language statements can be embedded in the RHS of 

from an information source, as is required in prior systems, a nonterminal for execution whenever that nonterminal is 

often results in a very large and cumbersome expression, one matched and with the variable bindings current at each 

which is difficult to write, understand, and maintain. To occurrence of a match of that nonterminal.In the case of the 

solve this problem of existing systems, the reg-exp compo- previous example, the definition for <item>can be extended 

nent specifies a regular expression by a set of rules for ^ Q as f 0 n ows . 

components of the regular expression. These components ' ' ou tput($0); } END. 

are labeled by nonterminals. However, in contrast to . \ 1 } f ; } \ .' 

context-free grammars, the set of rules in the reg-exp Whenever <item>Ls matched, $0 is set to the string matched 

component are not allowed to be recursive or mutually- b V sluff m the Parenthesis, and then the output statement is 

recursive. In other words, the rule for a particular nonter- executed. In this case, $0 gets bound to whatever characters 

minal cannot directly or indirectly refer to other rules which ^ followed "Data:" and preceded the newline character, 

refer to that particular nonterminal. Prior systems, for example AWK and Perl, do not have 

The following exemplifies the use of a set of rules and such a facility. Although they do set variables, such as $0, 

nonterminals. A top-level nonterminal defining an informa- $1, etc., they do so only after matching the single composite 

tion response can be: regular expression defining <page>. In order words, the 

<page> ::= <head> <item>* <tail> END. 40 entire page is matched before variables are set. Therefore, in 

which specifies that the response is a page consisting of a AWK and Perl, if there is more than one <item> on the 

head, followed by zero or more items, followed by a tail. The <page>, the relevant data fields in all but the last <item>are 

keyword "END" denotes the end of a rule. The second-level lost. The reg-exp component of the WDL solves this prob- 

nonterminals on the right-hand side ("RHS") of this rule, i em by allowing specification of actions that are executed 

<head>, <item>, and <tail>are defined by their own rules: 45 multiple times, once for each nonterminal match, with 

different variable bindings each time. 

«**s:~-r.r. Turning now to the description of the reg-exp language, it 

~~ ~ " " TTT^Z includes certain preferred base-level features: definition of 
<head> ::= Results of your search :\n END; - , . • . , i- • 
<item> ::= "Data: .\n" end; nonterminal regular expressions; embedding action lan- 
<taii> ::- "No more resuits\n" end. 50 guage statements in regular expression rules; operators for 
expressing alternation and repetition; and literal string 

To execute these rules, the reg-exp component compiler mat f ch ' " ™ U *> e a PP arenl to , those of s * il1 in lh , e art lhat a 

substitutes into the RHS of <page>the RHSs of the rules for preferred base-level reg-exp language ot equivalent seman- 

<head>, <item>, and <tail>. The result is as if the wrapper tic expressive abilities can be constructed with a slightly 

contained the-large, cumbersome composite top-level rule: 55 different choice of features, and the reg-exp language com- 
ponent of the WDL includes such well-known equivalents. 
In particular, the reg-exp language comprehends variable 

— ■ renamings and known grammatical transformations applied 

<page> ::= "Results of your search:$n" ("Data: to tne m \ ts below. In optional embodiments, the preferred 
• • " No more results $ n " ENP 60 base features can be augmented with such additional fea- 
tures as: a special disjunction ("||"); repetition for an arbi- 
If the second- level rules had contained further nonterminals trary integer; matching to character classes; local string 
on their RHSs, the compiler would continue making appro- memory; and anchoring characters. Such additional features 
priale substitutions until there are no more nonterminals on are standard with the exception of the special disjunction, 
the RHS of the composite rule for the top-level rule. Because 65 which causes the listed alternatives to be matched simulta- 
of the lack of recursion or mutual recursion, this substitution neously as the string is parsed, the first alternative to match 
terminates. being returned. In contrast, the regular disjunction matches 
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each alternative in the sequence listed, backtracking until 
success is found. The additional features can be added to the 
base-level language in known manners. See, e.g. Schwartz, 
1 993, Learning Perl, O'Reilly & Associates, Inc.; Aho et al., 
1986, Compilers Principles, Techniques, and Tools, Addison 
Wesley Publishing Co.; Hopcrofl al el., Introduction to 
Automata Theory, Languages, and Coniputation y Addison 
Wesley Publishing Co. 

In a further alternative embodiment, declarations could be 
added to control backtracking in order to improve perfor- 
mance of the regular expression matching. If a certain 
portion of a rule were known to require no backtracking, that 
portion can be bracketed with declarations instructing the 
compiler and interpreter to generate a finite state machine 
without any provision for backtracking. This permits that 
portion of the rule to be matched more efficiently than 
otherwise. 

The syntax of the preferred base-level action language is 
given ny the following grammar expressed in a standard 
notation. 



Rule 


::- <nontcrminaI> "::-"Regexp "END" 


Rcgcxp 


::= Sequence ("|" Sequence )* 


Sequence 


::= Repetition+ 


Repetition 


::- Term "?" 




| Term 




| Term "■»-" 




| Term 


Term 


::= String_In_Double„Quotcs 


| "stuff" 




j " ("Rcgcxp") ' 




j nonterminal > 




| Action 




Action 


::= "{" Statement* ")" 



Briefly, a rule specifies a particular <nonterminal>to rec- 
ognize a particular regular expression. A regular expression 
can include: disjunctions ("["); sequences; zero or more 
repetitions ("+"); one or more repetitions ("V); and zero or 
one repetition ("7"). Terms can be: literal strings enclosed by 
double quotation marks; the special symbol stuff; parenthe- 
sized regular expressions; nonterminals, which must be 
defined another rule; or statements written in the action 
language. 

The Wrapper Description Language 

A complete wrapper description can be: 

WrapperDescription ::= Statement Rule* 
Thus the preferred WDL entities include a statement in the 
action language, which is typically a sequence of action 
language statements, followed by an optional set of rules in 
the reg-exp language, which define a regular expression for 
matching a response returned from information sources. 

To execute a complete wrapper description, the statement 
is executed as is described subsequently. Typically, the 
statement contains calls to the built-in Functions fetch( ) and 
parse( ), among others. The parse( ) built-in function 
attempts to match a response returned by fetch( ) by invok- 
ing a nonterminal defined by the appended rules, each of 
which defines a regular expression. If the regular expression 
match is successful, all the action language statements 
typically embedded in the regular expression are executed, 
typically some action statements embedded in lower level 
nonterminals are executed multiple times with the operand 
bindings current at each occurrence of a match, and the 
parse( ) function returns the value 44 1". If the match is 
unsuccessful, none of the embedded statements are 
executed, and the parse( ) function returns the value "0". 

5.5.2. I M PLEM ENTATI O N OF THE WDL 
COMPONENTS 

The preferred implementation of the WDL is described in 
this section under the following headings: (1) parsing regu- 



lar expression rules. (2) intermediate code generation fc r 
regular expressions, (3) run-time interpretation of regular 
expressions, (4) action language code generation and run- 
time interpretation. It is understood that this preferred imple- 

5 mentation of the WDL is exemplary. It is known to those of 
skill in the art that alternatives exist to the implementation 
described herein. For example, the processes described can 
be implemented differently, e.g., with different variables and 
different orderings of the individual steps. Also, the pro- 

10 cesses may implement alternative algorithms to achieve the 
same effects for, e.g., string matching. For a further example, 
the described implementation discloses processes of inter- 
preting various intermediate codes. Different intermediate 
codes can be used. For the regular expression language, 

15 different types of nodes are possible including, e.g., nodes 
having a variable list of successor states to avoid having to 
maintain a backtracking stack of successor states. For the 
action language, instead of the disclosed address-based 
intermediate code, equivalently a reverse Polish stack-based 

20 intermediate code could be used. Finally, instead of 
interpretation, it is possible to compile directly to a machine 
language using the disclosed syntax -directed methods. 

In the remainder of this section, a parse tree node that has 
label "1", contains data "d", and has child nodes "c_l, . . . 

25 , c_n" is denoted by "l<d;c_l, . . . ,c_n>". These parse tree 
nodes can be constructed and referenced in way known in 
the art, e.g., by a pointer to a data area containing node data. 
Parsing Regular Expression Rules 
The first step in the preferred implementation of compil- 

30 ing a wrapper description is parsing the rules of the reg-exp 
language and the statements of the action language. The 
input to this step are the rules of the reg-exp language 
defining the regular expression to match. The output from 
this step is intermediate code in the form of a set of parse 

35 trees, one parse tree for the top-level rule and additional 
parse trees for each lower-level rule. 

In a preferred embodiment, the process of this step is 
performed by parsing according to a recursive descent parser 
and emitting the parse tree according to a syntax-directed 

40 translator. Construction of a recursive descent compiler for 
the previously described syntax of the reg-exp language is 
well known to those of skill in the art. For example, it is 
clearly disclosed with examples in such textbooks as Aho et 
al., 1986, Compilers Principles, Techniques, and Tools, 

45 Addison Wesley Publishing Co. at section 2.4 and 4.4. 
Syntax directed construction of parse trees is covered with 
examples in section 5.2. This invention is not limited to this 
preferred embodiment. Alternative parsing techniques are 
known in the art, and this invention comprehends embodi- 

50 ments using such techniques, such as LL or LR parsing. See 
generally Aho at al. at chapter 4. This invention also 
comprehends alternative techniques of intermediate code 
generation known in the art. See generally Aho et al. at 
chapter 5. 

55 The specification for the syntax-directed analysis includes 
rules for the parse-tree nodes to be created when each syntax 
rule is recognized by the parser. For each of the previous 
reg-exp syntax rules the following nodes, labeled by the 
nonterminal of the rule, are created: 
60 1. Rule: Create the node "Rule <nonterminal-name; node- 
for-regexp> This is the node for an entire rule labeled 
with its nonterminal name. 
2. Regexp: Create the node "Alterna lives <node_for_ 
sequence_l , . . . , node_for_sequence_n>." This is the 
65 node for a disjunction ("|") of alternative regular 
expressions, which are tested for a match in the order 
listed, the first successful match being returned. 
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X Sequence: Create the node "Sequence <node_for_ For example, consider the wrapper 
rcpetition_l . . . , node__for_repetition_n>." This is the 

node for a sequence of regular expression patterns each of 

which must sequentially match. <p agc > (-daw-) (-bill") <*> ("dan") { 

4. Repetition: Create the node "Repetition <type; node_ outpm($2); } end 

for_term>." This is the node for a repetition, where type <a> ' oren " (stuff -cody-) {output($o); } end 

is one of "?" (0 or 1 repetition), "+** (1 or more 

repetitions), or (0 or more repetitions). Here, $2 in the output statement in <page> references 

5. String_InJ>ouble_Quotes: Create the node "String 10 f da "> ^"f $ A ° [ in lhe ' out P ut statemem in <a> references 
.. , , . „ „. . t . i . 4 u r« i (stuff ooay ). After substitution, if va nab les in the output 

<litera! stnng>. This is the node to match a literal \ t . 3f . , , niIO t r , . 

— 6 statement were not renumbered, the RHS of <page> would 

strm 8- be expanded as follows: 

6. Stuff: Create the node "Stuffo" for the reserved word 

"stuff." 

1. (Regexp): Create the node "Sequence 15 

_ „ * . . i r n <pagc> ("davc") ("bill") "orcn" (stuff "cody") { 

<OpenParenthesis<i>, node_f o r„Rege xp , outp Ul ($0); } ("dan-) { output($2); } 

CloseParenthesis<i»." This is the node which causes 

assignment to a sequentially numbered variable on a Hefe hov/cv ^ $Q ^ {0 ins(cad of (stuff 

match by "Regexp." The node "OpenParenthesis<n> is a 20 « cody ») > whi i e $2 refers to (stuff "cody") instead of ("dan"), 

node representing the n'th open parenthesis that is To maintain correct variable assignment, the variable re fer- 

encountered in sequentially parsing the RHS of the cur- ences must be renumbered as follows: 
rent rule. The rule "Close Pa renthesis<n>" is a node 
represented the corresponding n'th close parenthesis 

encountered. 25 <page> ... (yave .) ( - biir) . oren . (stuff . codyB) { 

8. <nonterminaI>: Create the node "Nonterminal output($2); } ("dan") { output($3); ) 
<nonterminal__name>," for an instance of a nonterminal. 

9. Action: Create the node "Action <node„for_action The third step in the preferred embodiment of the code 
language_slatement>." This is a node for an instance of ^ Q generation process converts the processed and expanded 
an action language statement in which the node_Jor„ " RHS into an FSA by performing a postorder traversal of the 
action_language_statement represents intermediate code parse tree representing the RHS of the top-level rule, during 
for the action language statement. which each " ode is in r turn converted into an FSA. The FSAs 
r ™ , , . , . have several types or states, most importantly: 

fliese rules, which are executed when the parser recog- t _ , *f t * r < ( , 

. I. Char_branch: Branch to one or 256 possible successor 

nizes the corresponding regexp language syntax rule, cause 35 ^ depending ofl ^ curfent chafacter and adyances 

the generation of a parse tree, which is formed from the the curfent read position; 

listed node types linked in a true structure. This parse tree is 2 . Action: Executes a block of action language statements; 

input to subsequent steps of this process. 3. Marker: Records the current input string position, plus or 

Intermediate Code Generation for Regular Expressions 40 minus a certain offset, on a mark stack; 

^ , 4 . . r A . t . , Am tUr% 4. Success: When reached, parsing has succeeded; 

Code generation in the preferred embodiment includes the ' ? 6 9 . 

following three ste s* When reached, the current attempt at parsing has 

* • P • failed, and the FSA must either backtrack to the prior 

1. Special preprocessing of stuff nodes; backtracking point or fail if no backtracking points avail- 

2. Eliminating occurrences of nonterminals in the RHS of 4S able; 

the top-level nonterminal; and -6- Push: Pushes a configuration comprising an FSAstate and 

3. Converting the ^AS^Sfis of the top-level rule into the f™™ 1 ;inp«^tnngrposi»ioD J »onto the backtracking 

€■ 7 , . /«i-oa»v r,,, t a * stack for later backtracking 

a finite state automaton FSA . Ihese steps are, first, ^ Generation s v 

generally described, and second, described in detail. ^ ferfed pre . processing of stuff nodes begins by 

First, preprocessing of stuff nodes is necessary to later use 50 expanding Qf flallening any nestcd sequence s, i.e., 

the preferred postorder traversal method of FSA generation. S equence<> elements nested within other Sequenceo ele- 

As an FSA cannot be built for a stuff element alone, it is ments is done accor ding to the following process: 
necessary to build an FSA for the combination of a stuff 
node and the literal string node that must follow each stuff 

element, only open and close parentheses can intervene 

between a Stuff node and its following literal String. To do SO, F0R nonterminal PERFORM a POSTORDER traversal of Us 

_ , . . - 1 P arse tree 

the parse tree is traversed and each stuff node is merged with At parse tree node v 

the following literal string by replacing both nodes with a if (v=Sequence<c_i, . . . , c_n>) THEN 

new Stuff And String node. The new node also accounts , n while (there is an i, i<=i<=o, such that 

. ~~ ~~ , 6U c„i=Sequence<d_l, . . . , d_m>) DO 

for any intervening parentheses. Rep]ace node v with a new node 

The second step in code generation substitutes all is Sequence <c_i, . . . , c_i-i, d_i, 

nonterminals in the RHS of the top-level nonterminal with d - m » c - i+1 > • • ■ > c ~ n> 
their respective RHSs to generate a single composite rule for 

the overall regular expression. During this substitution, it is 65 Next, each occurrence of a stuff node is replaced by a 

necessary to renumber occurrences of variables in action Stuff__And_String node that also accounts for the following 

language statements. literal string and any intervening parentheses. Every occur- 
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re nee of "stuff" in a wrapper must be followed by a literal 
string with the semantic result that "stuff" matches all 
characters from the current position up to, but not including, 
the first occurrence of the following literal string. Only open 
and close parentheses can intervene between stuff and the 
literal string. In the preferred embodiment, this replacement 
is preferably performed according to the following process: 



FOR nonterminal PERFORM a POSTORDER traversal of its 
parse tree. 

At each node v of the parse tree 

IF (v=Sequence<x_3 ) . . . , c_n>) THEN 
WHILE (there is on i, 1 <-i<-a, such that 
c_i=StutT<>) Do 

Let j be the smallest j>i such that 
c_j-String<s>for some string s. 
IF there is no such j THEN signal an 
error. 

ELSE IF (any element ck (where i<k<j) is 
NOT cither an OpenParcnthesis or 
Qose Parenthesis) THEN signal an 
error 

/* cj is node for the string */ 
ELSE replace node v with a new node = 
Sequence <c_l ) . . . , c_i-l, 
Stuff_And_String<c_j; c_i+l, . . . , 
c_j-l>, c_j+], . . . , o_n> 

This process merges the stuff node, the literal string node, 
c_j, and any intervening parentheses nodes, . . . c__j-l, into 
the single node StulT_And_String<the-slring; c_J+l, . . . , 
c_j-l>. 

Code Generation Step 2: Substitution And Variable Renam- 
ing 

Having completed pre-processing, nonterminals on the 
RHS of the top-level rule are preferably substituted and any 
action statement variables renumbered. Substitution and 
variable renumbering preferably use a process having two 
functions, EXPAND_REGEXP and EXPAN D_ACTION 
and a global variable PAREN__COUNT. EXPAN D_ 
REG EXP is called recursively to substitute for RHS 
nonterminals, starting with a call to the top-level nontermi- 
nal. PAREN__COUNT is initialized to 0, and as substitution 
proceeds, it is incremented for each "(" encountered. 
PAREN_COUNT thus has a count of the total number of 
"("s encountered so far. Then during substitution, parenthe- 
ses encountered are renumbered from their prior number to 
the current value*o£aPAREN_GOUNT and variables in 
action statements are renumbered in a corresponding fash- 
ion. For example, if PAREN_COUNT is currently 8, an 
OpenParenthesis<2> node is replaced by an Open Parenthe- 
sis <8> node, and an output($2) statement is replaced by an 
output($8) statement. EXPAND_ACTION performs action 
statement variable renumbering. These processes are as 
preferably performed according to the following: 
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OpenParenihesis<PAREN_COUNT> 
PAREN_COUNT = PA REN COUNT + 1; 
ELSE IF (v-CloscParenthcsis<i>) THEN 
REPLACE node v with 
CtosePorenthesis<:new_names|ij> 
ELSE IF (v-Nonterminal<x>) THEN 

REPLACE node v with EXPAND_REGEXP(x) 
ELSE IF (v=Action<x>) THEN 
10 REPLACE node v with EXPAND_ ACTION (x, 

Dcw_names(]) 

FUNCTION EXPAND ACTION fT : parse tree T for action 

language statement; new_names(J) : RETURNS a parse 
tree with renumbered variables 

PERFORM a PREORDER traversal of the parse tree T 
15 At each node v, 

IF (v denotes variable name $i (i an integer) 
THEN 

REPLACE node v with a new node with Si 
replaced by $(ncw__names[i|). 



20 Code Generation Part 3: Creating A Finite State Automata 
The final step in the preferred embodiment of code 
generation creates an FSA representing the substituted and 
processed regular expression. Although algorithms for cre- 
ating such FSAs arc known, they have not in the past 

25 provided facilities, for example, for embedding action lan- 
guage statements in such FSAs. Accordingly, the following 
description focases on those new features of the preferred 
embodiment of this process that are directed to supporting 
action language statements and the other novel features of 

30 the reg-exp language. 

An FSA output from this process starts its execution in an 
initial state with an input pointer set to the start of the input 
string and then repeatedly executes one of six basic proce- 
dures according to its current state. These procedures set a 

35 new current state and typically effect one or more of the data 
structures accessed by the FSA. The six types of states and 
their procedures are as follows: 

(1) Char_branch<next_state„_0, . . . , next__state_255>: 
When the current state is a char_branch state, the next state 

40 is selected according to the ASCII value, i, of the current 
input character as next_state„i, and the input pointer is 
advanced to the next character in the input string. 

(2) Marker-<num, open/close_flag, offset, next_state>: 
When the current state is a marker state, a record is pushed 

45 on a run-time stack known as the mark stack that contains 
the character position of a parenthesis encountered in the 
input string plus the indicated offset and an indication of 
whether this parenthesis is open or close, and the next state 
is set to next__state. An exemplary mark can push a record 

50 indicating that the fifth open parenthesis occurred at the 
current input position. 

(3) Action<compiled_action_statements, next_state>: 
When the current state is an action state, those action 
language statements represented by compiled_action_ 

55 statements are executed and the next state is set to next_ 



GLOBAL VARIABLE integer PAREN_COU NT; 

FUNCTION EXPANDER EG EXP ( <nt>: nonterminal) : RETURNS a 

parse tree with renumbered variables 

/* an array of integers with new variable numbers 

r 

LOCAL VARIABLE new_names[] 
LET T - parse tree for the RHS of rule for <nt> 
PERFORM a PREORDER traversal of T. 
At each node v of tree T 

IF (v-OpenPorenthesis<i>) THEN 
ncw_names|i| = PAREN_COUNT, 
REPLACE node v with 



(4) Successo: When the current state is the success state, 
a string has been successfully matched and this FSA termi- 
nates. 

(5) Push<slate, next_state>: When the current state is a 
push state, the current configuration of the FSA is pushed 
onto a run-time stack known as the backtracking stack, and 
the next state is set to next_statc. An FSA configuration is 
a record containing identification of at least the current state 
together and the current position in the input string. Push is 
used to support the non-deterministic constructs in (he 
regular expression language. An exemplary non- 
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deterministic construct is the disjunction "reg-cxprl |reg- 
expr2," which is matched by an FSA that, first, pushes a 
backtracking state onto the backtracking stack, then two, 
attempts to run the FSA corresponding to reg-cxprl; if that 
fails, then it backtracks and (c) attempts to run the finite state 
machine corresponding to reg-expr2. 

(6) Failureo: When the current state is the failure state, 
a backtrack record is popped off the backtracking stack and 
the next state and current input pointer are set to the contents 
of the backtrack record. If the backtracking stack is empty, 
the FSA has failed to match the input string and terminates. 

Creation of the FSA for an input regular expression is 
done using a postorder traversal of the previously produced 
parse tree. As the parse tree is traversed, to each node v is 
attached a finite state machine that matches the sub- 
expression represented by the sub-tree rooted at v. This 
attached finite state machine is the value of the variable 
v.rnachine. The final FSA output is the FSA that is attached 
to the root of the parse tree when the creation process 
completes. 

During the traversal, certain finite state machines are 
created in accordance with methods already known to those 
of skill in the art. These machines recognize standard regular 
expression constructs and can be constructed by known 
methods. General references for the creation of such stan- 
dard finite state machines include the following: Aho et a!., 
1974, The Design And Analysis Of Computer Algorithms, 
Addison Wesley Publishing Co., chapter 9; Aho et al., 1986, 
Compilers Principles^ Techniques, and Tools, Addison Wes- 
ley Publishing Co., chapter 3; Hopcroft et al., 1979, Intro- 
duction to Automata Theory, Languages, and Computation, 
Addison-Wesley Publishing Co., Section 2.5; Sedgewick, 
1990, Algorithms in C, Addison-Wesley Publishing Co., 
chapter 16. 

Further the process disclosed adopts certain construction 
methods that can advantageously by changed in alternative 
embodiment. For example, the machine representing a literal 
string can be constructed according to the Boyer-Moore or 
other string matching algorithm. See, e.g., Sedgewick at 
chapter 19. 

The preferred process is: 



PERFORM a POSTORDER traversal of the tree. 
At each node v, Do: 
IF (v=OpcnPorenthcsis<n>) THEN 
SET v.rnachine = Marker<n, "open", 0, 
Success<» 
ELSE rF (v-CIoseParenthesis<n>) THEN 
SET v.rnachine = Marker<n, "close", 0, 
Successor 
ELSE IF 

(v= Act ion <node_for_the__act io n _1 a agu agc„js tate mc n ts >) 
THEN 

SET v.rnachine - 

Action <compiled_action_statements, 
Success <» 
ELSE IF (v-String<Hteral_string>) THEN 
SET len = length (the-string) 
FOR i = 1 to len DO 

branch_state[i] - Char_b ranch <s_0, s__l } 
. . . , s_255>, where all the s_j's arc 
Failureo EXCEPT the s_j where j is 
the ASCII code for the i'th 
character of the-string which is 
Success <> 

SET v.rnachine - MAKE_SEQUENCE 
(branch_statc| 1 1, . . . , 
branch_staiellen]). 
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ELSE IF 

(v=Stuff_And_String<ahe-stiing.par„l, 
THEN 

/* m_l can be constructed according to known 
methods disclosed in the previously cited 
references */ 

SET m_l = a finite state machine that scans 
the input string for the first occurrence of 
the-string and stops as soon as it comes to 
the end of this first occurrence. 
FOR i - 1 to k DO 

SET 1 par_i. machine » marker state 
appropriate to parenthesis par_i and 
having an off set of 
-length (the-string). 
SET v.rnachine - MAKE_SEQUENCE (m_l, 

par_J .machine, . . . , pa r_k. machine). 
ELSE IF (v=Sequence<elenient_J J . . . , element 
THEN 

FOR i = 1 to k DO 
SET elemcnt__i. machine = machine 
constructed according to this 
process for element_i 
SET v.rnachine = MAKE_SEQUENCE 
(elemental, machine, . . . , 
elementjt.machine) 
/* a disjunction, V 

ELSE IF (v=Altcrnativcs<clcmcnt_l, . . . , element. 
THEN 

FOR i = 1 to k DO 
SET element_i. machine - machine 
constructed according to this 
process for element_i 

SET" v.rnachine - element k. machine 

FOR i = k-1 down to 1 DO 
v.rnachine = Push <v. machine, 
element_i.machine> 
/* type is '*?**, '*+", or ***" *l 
ELSE IF (vsRepetition^ype, etement>)) THEN 
SEF element. machine - machine constructed 
according to this process for element 
SET v.rnachine = a finite state machine 
constructed according to methods 
disclosed in the previously cited 
references for this type of repetition 
element.machine 



-k>) 



This process uses a MAKE_SEQUENCE function that 
builds a composite machine from a series of sub-machines. 
The composite machine executes each of the sub- machines 
in sequence until one sub-machine fails, in which case the 
composite machine also fails, or all the sub-machines 
succeed, in which case the composite machine succeeds. In 
other words, the composite machine runs the first sub- 
machine; if that succeeds, it runs the second; if that 
succeeds, it runs the third, and so on. If any sub -machine 
fails, i.e., reaches a Failureostate, then composite machine 
also fails. 



FUNCTION MAKE_SEQUENCE (machine m_l, m_2, 

RETURNS machine. 

ncw_m m_l; 

FOR i - 2 to k DO 

Traverse the states of new_m and replace 
any Success < >states with m_i 

RETURN new m 



m_Jc) 



This concludes the preferred construct ion of FSAs repre- 
senting regular expressions from the reg-exp language of 
this invention. Next, the FSA execution is described. 
Run-Time Interpretation of Regular Expressions 

In a preferred embodiment, (he FSA representing a regular 
expression is executed by interpreting the structure created 
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in the previous code generation steps. The interpreter pref- 
erably accesses several data structures including the current 
slate of the FSA, a pointer to the current character position 
of the input string, a backtracking stack and a marker stack. 
The backtracking stack contains records characterizing the 
state of interpretation of the FSA and is used in a manner 
known in the art to implement backtracking, in case an 
attempted partial match of the input string fails. Configura- 
tion records include the current state and the current input 
position. In an alternative embodiment, push states can be 
avoided by implementing states with a list of possible next 
states. 

The preferred implementation of the novel variable bind- 
ing and action statement features of the reg-exp language of 
this invention utilizes an additional stack, ihe mark stack. 
The semantics of these features require that no actions be 
executed until the entire regular expression matching pro- 
cess succeeds. Upon success, all actions arc then executed 
with the variable bindings that occurred during parsing. This 
means that a single variable, e.g. $1, can have different 
bindings in the same action statement which can be executed 
multiple times on match success. 

This semantics is implemented in the preferred embodi- 
ment using the mark stack in the following manner. When 
the current interpreted state is an Actiono state, the action 
language statement is not immediately executed. Rather the 
action statement code is pushed onto the mark stack. 
Similarly, when the current state is a mark state, which 
represents a parenthesis in the regular expression definition, 
the information present in the mark state is pushed onto the 
mark slack. This information permits finding the position in 
the input string of the beginning or ending of a current 
variable binding. To prevent execution of action language 
statements encountered during a failed attempted partial 
match, the mark stack is popped to an appropriate level 
when the FSA backtracks. Thus, the machine configuration 
and the configuration record on the backtracking stack 
further includes the current position in the mark stack. 

Action language statements are executed upon final match 
success by scanning the mark stack from bottom to top. In 
this scan, variable bindings are set as indicated by mark state 
records and actions are executed as indicated by action state 
records. In case of match failure, no action language state- 
ments are executed. 

In more detail, an embodiment of the preferred imple- 
mentation functions as follows: 



GLOBAL VARIABLE current_state = initialized to the 
initial machine state 

GLOBAL VARIABLE input_pos - initialized to the address 
of the beginning of the input string 

GLOBAL VARIABLE BT_STACK o initialized to an empty slack 
GLOBAL VARIABLE MARK_STACK - initialized to an empty 
stack 

REPEAT UNTIL one of the following clauses exits 
CASE (current_state) IS 

Char_brnnch<state_0, .... state_255>: 
cur re n testate = state_(ASCII value of the 

current input character) 
input_pos = input_pos + 1 
Marker-enum, open/close^flag, offset, nextstato: 
push ("mark", num, open/close_fiag, 

input_pos+offsct) onto MARK_STACK 
currenl_slate = nextstate 
Action<complied_action_code, nextstato: 
push ("action", compilcd_action_codc) onto 

MA R K_STAC K 
current_state - nextstate 
Push estate, nextstato: 
push (stale, input_pos, height(MARK_STACK)) 
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onto I3T„STACK 
current_state = nextstate 
5 f* exit the REPEAT statement on final success "/ 
Successor GOTO done 
Failureo: 
f* match finally fails if backtracking is 
attempted and the backtrack stack is empty */ 
IF BT_STACK is empty THEN 
10 RETURN "fair 

ELSE 

pop BT_STACK record into variables si, 
ip, msh 

currenl_state = st 
input__pos = ip 
J5 pop MA RK_STACK down to height msh 

END CASE 

/* when parsing succeeds, scan MARK^STACK and execute 
actions with variable bindings indicated by the marks */ 
done: 

LOCAL VARIABLE open_marks[l close_marks0 
FOR clement - bottom of MARK_ STACK up to top of 
20 MARK_STACK DO: 

IF (element=("mark",num,"open",pos)) THEN 
SET 1 open_marks[num] = pos 
IF (clcmcnt»( <t mark , ',num,"closc".pos)) THEN 
SET cIose_marks[num] = pos 

IF (element=("action",compiled_aaion_code)) THEN 
25 SET values of SO, $1, . . . , $n to the positions 
in the input string indicated by 
open_marks[nJ and close„marks[n] 
RUN compiled_actioQ_staternent using these 
bindings of $0, $1, . . . , $n 
END FOR 

30 /* match succeeded and all action executed */ 
RETURN "succeed" 



For example, in the preceding FOR loop, the binding of $2 
is set to the sub-string of the input string beginning at 
35 open_marks[2] and ending at close_marks[2]. 

Action Language Code Generation and Run-Time Interpre- 
tation 

In a preferred embodiment, action language statements 
are compiled into a variable length bytecode-type of inter- 

40 mediate language that is executed at run-time by a bytecode 
interpreter, The previously described action language syntax 
is parsed by any appropriate known parsing technique, for 
example by LLor LR parsing. Adequate parsing techniques 
are presented with examples in Aho et al., 1986, Compilers 
Principles, Techniques, and Tools, Addison Wesley Publish- 

45 ing Co., chapter 4. During parsing all variables occurring in 
the wrappers are assigned unique numbers. A preferred such 
assignment assigns sequential positive integers to named 
variables, e.g., $x, $abc, in the order they are encountered, 
and assigns sequential negative integers to the numeric 

5Q variables denoting matched sub-strings, e.g., $0, $1, $2, etc., 
the numeric variable $i being assigned the number -(i+1). 

Intermediate code is generated by known syntax-directed 
translation techniques. Adequate syntax directed translation 
techniques are presented with examples in Aho et al. at 
chapter 8. The preferred intermediate code is presented 

55 below. In this presentation, intermediate language instruc- 
tions have the format of an instruction code, with optional 
modifying information, followed by zero of more argu- 
ments. Variables are denoted by <var> where var is the 
2-byte integer coding of the variable. Relative branch offsets 

60 are encoded as 4 -byte integers. All branches are relative to 
the current instruction address, i.e., the address of the branch 
instruction itself, so the bytecode is relocatable. 
1. Function Calls: 
Encoding: 0 funcode numargs <var_0> <var„l>. . . 

65 <var_numargs> 

Meaning: varO is assigned the result of the function 
identified by "funcode" on the arguments var 1, . . . , 
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var_numargs of number "numargs." Each built-in 
function, e.g. agrc, argv, etc., and each operator, e.g., 
"+", etc., of the language has its own unique 

integer "funcode". 

2. Branch Instructions: 
Encoding: I <offset> <varO> 

Meaning: if varO is true, branch by amount offset relative 

to the branch instruction 
Encoding: 2 <offset> <varO> 

Meaning: if varO is false, branch by amount offset relative 

to the branch instruction 
Encoding: 3 <offset> 

Meaning: always branch by amount offset relative to the 
branch instruction 

3. Load Constant Instructions: 
Encoding: 4 <varO> <null-terminated-string> 
Meaning: varO is assigned the null-lerminated-string con- 
stant 

Encoding: 5 <varO> < floating-point-const an t> 
Meaning: varO is assigned the floating-point-constant 

4. Move Instructions: 
Encoding: 6 <varO> <varl> 
Meaning: varO is copied from varl 

5. Output Instructions: 
Encoding: 7 <varO> 

Meaning: outputs one item varO to the current tuple 
Encoding: 8 

Meaning: terminates the current tuple, causes it to be 
returned to the netbot 

6. Parsing Instructions: 
Encoding: 9 <varO> <varl> < address of the linite state 

machine for <nt» 
Meaning: varO is assigned the result of the parse of varl 
according to the regular expression denoted by <nt> 

7. Exit Instructions: 
Encoding: 10 

Meaning: exits the current action language block 
In alternative embodiments, different intermediate codes 
or even no intermediate code can be used. For example, 
instead of: previously disclosed address- based intermediate 
code, equivalently a reverse Polish stack-based intermediate 
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by executing bytecode instruction code 7 for each variable 
in the output list followed by a bytecode instruction code X 
that terminates the current tuple, i.e., marks the end of the 
output statement. 

6. SPECIFIC EMBODIMENTS. CITATION OF 
REFERENCES 

The present invention is not to be limited in scope by the 
specific embodiments described herein. Indeed, various 
modifications of the invention in addition to those described 
herein will become apparent to those skilled in the art from 
the foregoing description and accompanying figures. Such 
modifications are intended to fall within the scope of the 
appended claims. 

Various publications are cited herein, the disclosures of 
which are incorporated by reference in their entireties. 
What is claimed is: 

1. A wrapper description language method comprising 
recognizing a string by scanning the string as a regular 
expression written in a regular expression language, the 
regular expression language capable of specifying an entire 
regular expression in ternsof component regular expressions 
such that, upon recognition of one component regular 
expression of the entire regular expression, actions can be 
executed with variables bound as of the time of recognition 

25 of the component regular expression, the actions actually 
being executed only if the entire regular expression is also 
recognized. 

2. The wrapper description language method of claim 1 
further comprising the step of executing one or more slate- 

30 rnents written in an action language, the action language 
capable of specifying the actions of operators on variables, 
the actions of functions on variables, and of the sequence of 
statements the statements of the action language being 
further capable of being embedded in the regular expression 
35 language for specifying the actions occurring on the recog- 
nition of the component regular expressions. 

3. A computer readable medium containing computer 
executable instructions for causing one or mote computers to 
perform a wrapper description language method comprising 
recognizing a string by scanning the string as a regular 
expression written in a regular expression language, the 
regular expression language capable of specifying an entire 
regular expression in terms of component regular expres- 
sions such that, upon recognition of one component regular 
expression of the entire regular expression, actions can be 
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code could be used. Optionally, no intermediate code is 4 $ executed with variables bound as of the time of recognition 



used, and action language statements are compiled directly 
to a machine language. Both these alternative can be imple- 
mented by using the disclosed syntax-directed methods. See, 
e.g., Aho et al. 

Execution of wrapper actions is performed in a preferred 
embodiment in several steps. First, memory is allocated to 
store all variables and variable values. The number of 
variables is known after the action statement have been 
parsed. All variables are initialized to an unassigned value. 
Next, a pointer CURRENT_INSTRUCTION is initialized 
to point to the first instruction of the bytecode to be 
executed. Then, the interpreter enters a loop which retrieves 
the bytecode pointed to by CURRENT_JNSTRUCTlON, 
performs the coded simple action according to the bytecode 
language previously described, and updates CURRENT_ 
INSTRUCTION to point either to the next instruction, or for 
taken branch instructions, is modified by the offset value. 
The interpreter finishes execution upon encountering an exit 
instruction. 

The parse built-in function is executed at run-time by 
calling the FSA interpreter to interpret the compiled regular 
expression code specifying the finite state machine. Output 
(a_l, . . . ,a__n) statements are translated and then executed 
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of the component regular expression, the actions actually 
being executed only if the entire regular expression is also 
recognized. 

4. A computer program product stored on a computer 
readable medium, and including computer executable 
instructions for causing one or more computers to interpret 
strings written in a regular expression language by, perform- 
ing the steps of: 

receiving a string written in the regular expression lan- 
guage; 

scanning the string to recognize a plurality of component 
regular expressions; 

binding variables defined in the component regular 
expressions as the component regular expressions are 
recognized: and 

upon recognition of an entire regular expression contain- 
ing the plurality of recognized component regular 
expressions, executing at least one action using the 
bound variables from the component regular expres- 
sions. 
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